Packages for today:
library(tidyverse)
library(ggbeeswarm)
library(janitor)
library(glmmTMB)
library(car)
library(emmeans)
library(DHARMa)
library(DiagrammeR)#this one is not necessaryGoals for today:
A Generalized Linear Mixed Model (GLMM) is a GLM with random effects. We have seen we need GLMs to model response variables corresponding to presence/absence or to count data. We have seen that we use random effects to account for structure (such as spatial or time structure) in the unexplained variation of a response variable. Now we put together GLM and random effects in the same model.
You should already be familiar with most features of GLMs and random effects. Today we will mostly insist on special, counter-intuitive, aspects of GLMMs.
The presence of a rare bettle has been monitored across 18 locations where it is known, on 22 consecutive days, in order to establish a standardized monitoring protocol before looking for the beetle in locations where it has not been detected yet. We suspect the bettle is more likely to be detected at higher temperatures because of increased activity. We want to test the effect of temperature on the probability to detect the species and check whether temperature has a consistent effect across days and locations.
We can visualise the effect of temperatures overall:
beetles <- read_csv("beetle_detection.csv")## Parsed with column specification:
## cols(
## temperature = col_double(),
## presence = col_double(),
## site = col_character(),
## day = col_double()
## )
str(beetles)## tibble [528 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ temperature: num [1:528] 7.91 15.03 18.91 2.44 15.76 ...
## $ presence : num [1:528] 0 0 1 0 0 0 0 0 0 0 ...
## $ site : chr [1:528] "A" "B" "C" "D" ...
## $ day : num [1:528] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. temperature = col_double(),
## .. presence = col_double(),
## .. site = col_character(),
## .. day = col_double()
## .. )
beetles %>% mutate(site_fct=factor(site)) %>%summary()## temperature presence site day
## Min. :-2.601 Min. :0.0000 Length:528 Min. : 1.00
## 1st Qu.:10.450 1st Qu.:0.0000 Class :character 1st Qu.: 6.00
## Median :13.621 Median :1.0000 Mode :character Median :11.00
## Mean :13.705 Mean :0.5833 Mean :11.59
## 3rd Qu.:16.873 3rd Qu.:1.0000 3rd Qu.:18.00
## Max. :29.040 Max. :1.0000 Max. :22.00
##
## site_fct
## A : 22
## B : 22
## C : 22
## D : 22
## E : 22
## F : 22
## (Other):396
ggplot(beetles, aes(x=temperature, y = presence, color=site)) + geom_beeswarm(groupOnX = FALSE)And the variation in mean detection across locations, and across days:
beetles %>% group_by(site) %>%
summarise(prob = mean(presence),
SE=sd(presence)/sqrt(length(presence)-1), lowCI=prob-1.96*SE,
upCI=prob+1.96*SE) %>%
ggplot(aes(x=site, y=prob)) + geom_point() + geom_errorbar(aes(x=site, ymin=lowCI, ymax=upCI)) +
ylim(0,1)## `summarise()` ungrouping output (override with `.groups` argument)
beetles %>% group_by(day) %>%
summarise(prob = mean(presence),
SE=sd(presence)/sqrt(length(presence)-1), lowCI=prob-1.96*SE,
upCI=prob+1.96*SE) %>%
ggplot(aes(x=day, y=prob)) + geom_point() + geom_errorbar(aes(x=day, ymin=lowCI, ymax=upCI)) +
ylim(-0.2,1.2)## `summarise()` ungrouping output (override with `.groups` argument)
It looks like there is more variation among days than among locations; but there could be some clustering in both days and locations; we should definitely account for both in our models. Notice how the approximated confidence intervals we drew for days are wrong: two of them span over 1, although 1 is the logical maximum value.
Most sites were monitored once a day; but sometimes sites were missed, or monitored twice on the same day. That may reflect problems with logistics.
tabyl(beetles, var1 = day, var2 = site)## day A B C D E F G H I J K L M N O P Q R S T U V W X
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1
## 7 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
## 8 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1
## 10 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
## 11 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 12 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 13 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1
## 15 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## 16 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 19 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1
## 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
There are probably not enough measurments to estimate site-by-day interactions. We will not be able to model space-time interactions, although we must keep in mind that some interesting variation could exist there. We simply cannot know with those data.
The most basic model we could fit to answer our question is:
m_beetle_t <- glmmTMB(data = beetles, formula = presence ~ 1 + temperature, family = "binomial")
summary(m_beetle_t)## Family: binomial ( logit )
## Formula: presence ~ 1 + temperature
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## 663.2 671.7 -329.6 659.2 526
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.66420 0.29634 -5.616 1.96e-08 ***
## temperature 0.14900 0.02132 6.988 2.80e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m_beetle_t, type=3)## Analysis of Deviance Table (Type III Wald chisquare tests)
##
## Response: presence
## Chisq Df Pr(>Chisq)
## (Intercept) 31.537 1 1.957e-08 ***
## temperature 48.826 1 2.797e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = red]
presence; temp
node [shape = circle, color = black]
p; y
node [shape = diamond, color=green]
intercept;
y -> p [label = ' plogis'];
p -> presence [label = ' Bernoulli', style=dashed];
intercept ->y; temp -> slope -> y ;
}")Model diagnostics are about right:
testResiduals(m_beetle_t)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.029792, p-value = 0.7368
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.99977, p-value = 0.992
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 0, simulations = 528, p-value = 0.02557
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.000000000 0.006962165
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.029792, p-value = 0.7368
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.99977, p-value = 0.992
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 0, simulations = 528, p-value = 0.02557
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.000000000 0.006962165
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0
The only possible problem is the presence of outliers; maybe we would not worry much about that.
However, that model is not correct conceptually. We know data were collected on different days and different locations, and we saw apparent variation among days and locations. We must account for days and locations before we can trust any result.
We could do this using fixed effects:
m_beetle_t_s_d <- glmmTMB(data = beetles, formula = presence ~ 1 + temperature + site + as.factor(day),
family = "binomial")
summary(m_beetle_t_s_d)## Family: binomial ( logit )
## Formula: presence ~ 1 + temperature + site + as.factor(day)
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## 408.3 604.7 -158.1 316.3 482
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.638e+00 1.037e+00 -5.435 5.49e-08 ***
## temperature 3.069e-01 4.092e-02 7.500 6.36e-14 ***
## siteB -3.049e-01 9.929e-01 -0.307 0.758765
## siteC 9.431e-01 1.001e+00 0.942 0.345950
## siteD 2.274e+00 1.153e+00 1.973 0.048522 *
## siteE -3.838e-01 9.206e-01 -0.417 0.676712
## siteF 5.618e-01 9.012e-01 0.623 0.533046
## siteG 9.941e-01 1.025e+00 0.970 0.332156
## siteH 2.493e-01 9.566e-01 0.261 0.794351
## siteI 6.353e-01 9.545e-01 0.666 0.505710
## siteJ 8.960e-01 9.773e-01 0.917 0.359284
## siteK 2.539e-01 1.008e+00 0.252 0.801120
## siteL -4.761e-01 1.001e+00 -0.476 0.634260
## siteM -7.680e-01 1.083e+00 -0.709 0.478397
## siteN -4.796e-01 9.602e-01 -0.499 0.617471
## siteO -5.048e-01 9.312e-01 -0.542 0.587773
## siteP -5.336e-01 9.452e-01 -0.565 0.572407
## siteQ 2.550e-01 9.375e-01 0.272 0.785657
## siteR 2.714e-01 1.046e+00 0.260 0.795165
## siteS 5.607e-01 9.983e-01 0.562 0.574368
## siteT -1.427e-01 9.543e-01 -0.150 0.881105
## siteU 3.543e-01 9.505e-01 0.373 0.709314
## siteV -2.734e-01 9.928e-01 -0.275 0.783049
## siteW -3.250e-01 9.662e-01 -0.336 0.736606
## siteX -1.147e+00 1.002e+00 -1.144 0.252581
## as.factor(day)2 1.142e+00 8.163e-01 1.399 0.161878
## as.factor(day)3 2.208e+00 8.249e-01 2.676 0.007444 **
## as.factor(day)4 5.155e+00 1.229e+00 4.193 2.75e-05 ***
## as.factor(day)5 3.167e+00 8.058e-01 3.931 8.46e-05 ***
## as.factor(day)6 9.492e-01 8.109e-01 1.171 0.241788
## as.factor(day)7 4.322e+00 9.847e-01 4.389 1.14e-05 ***
## as.factor(day)8 2.608e+00 7.907e-01 3.298 0.000974 ***
## as.factor(day)9 3.198e-01 7.707e-01 0.415 0.678169
## as.factor(day)10 2.800e+01 5.503e+04 0.001 0.999594
## as.factor(day)11 2.682e+01 3.793e+04 0.001 0.999436
## as.factor(day)12 8.472e-01 8.288e-01 1.022 0.306727
## as.factor(day)13 2.631e+01 3.679e+04 0.001 0.999430
## as.factor(day)14 -7.080e-01 9.019e-01 -0.785 0.432420
## as.factor(day)15 1.332e+00 9.314e-01 1.430 0.152815
## as.factor(day)16 2.596e+01 1.699e+04 0.002 0.998781
## as.factor(day)17 2.672e+00 8.649e-01 3.090 0.002004 **
## as.factor(day)18 1.913e+00 7.585e-01 2.523 0.011648 *
## as.factor(day)19 -2.248e+01 1.114e+04 -0.002 0.998390
## as.factor(day)20 3.264e-01 8.212e-01 0.398 0.690991
## as.factor(day)21 -2.968e+00 1.127e+00 -2.633 0.008466 **
## as.factor(day)22 2.261e+00 7.585e-01 2.981 0.002875 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m_beetle_t_s_d, type=3)## Analysis of Deviance Table (Type III Wald chisquare tests)
##
## Response: presence
## Chisq Df Pr(>Chisq)
## (Intercept) 29.536 1 5.490e-08 ***
## temperature 56.257 1 6.359e-14 ***
## site 18.678 23 0.7198
## as.factor(day) 79.573 21 9.518e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That is okay, but not efficient in several ways:
Demonstration: whether a logit-scale prediction is 28 or 100000 you get the same data:
rbinom(n = 1000, size = 1, prob = plogis(28))## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [445] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [593] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [667] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [704] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [778] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [815] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [889] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [926] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [963] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1000] 1
rbinom(n = 1000, size = 1, prob = plogis(100000))## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [445] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [593] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [667] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [704] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [778] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [815] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [889] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [926] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [963] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1000] 1
m_beetle_tXd_s <- glmmTMB(data = beetles, formula = presence ~ 1 + temperature*as.factor(day) + site,
family = "binomial")## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
summary(m_beetle_tXd_s)## Family: binomial ( logit )
## Formula: presence ~ 1 + temperature * as.factor(day) + site
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## NA NA NA NA 461
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.649e+00 NA NA NA
## temperature 5.479e-01 NA NA NA
## as.factor(day)2 3.624e+00 NA NA NA
## as.factor(day)3 6.479e+00 NA NA NA
## as.factor(day)4 -2.896e+03 NA NA NA
## as.factor(day)5 3.269e+00 NA NA NA
## as.factor(day)6 1.030e+00 NA NA NA
## as.factor(day)7 1.146e+01 NA NA NA
## as.factor(day)8 -1.407e+00 NA NA NA
## as.factor(day)9 1.869e+00 NA NA NA
## as.factor(day)10 2.349e+03 NA NA NA
## as.factor(day)11 4.896e+02 NA NA NA
## as.factor(day)12 2.422e+00 NA NA NA
## as.factor(day)13 3.396e+02 NA NA NA
## as.factor(day)14 7.758e-01 NA NA NA
## as.factor(day)15 2.121e+00 NA NA NA
## as.factor(day)16 1.576e+03 NA NA NA
## as.factor(day)17 8.591e+00 NA NA NA
## as.factor(day)18 6.852e+00 NA NA NA
## as.factor(day)19 -1.422e+01 NA NA NA
## as.factor(day)20 4.613e+00 NA NA NA
## as.factor(day)21 7.594e-01 NA NA NA
## as.factor(day)22 7.275e+00 NA NA NA
## siteB -6.374e-01 NA NA NA
## siteC 2.842e-01 NA NA NA
## siteD 2.065e+00 NA NA NA
## siteE -7.897e-01 NA NA NA
## siteF 2.460e-01 NA NA NA
## siteG 6.209e-01 NA NA NA
## siteH -1.252e-01 NA NA NA
## siteI 5.858e-01 NA NA NA
## siteJ 5.071e-01 NA NA NA
## siteK -1.377e-01 NA NA NA
## siteL -1.063e+00 NA NA NA
## siteM -1.153e+00 NA NA NA
## siteN -1.096e+00 NA NA NA
## siteO -1.134e+00 NA NA NA
## siteP -9.464e-01 NA NA NA
## siteQ -4.196e-02 NA NA NA
## siteR -1.324e-01 NA NA NA
## siteS 2.070e-01 NA NA NA
## siteT -5.242e-01 NA NA NA
## siteU 4.429e-02 NA NA NA
## siteV -1.028e+00 NA NA NA
## siteW -5.191e-01 NA NA NA
## siteX -1.499e+00 NA NA NA
## temperature:as.factor(day)2 -1.607e-01 NA NA NA
## temperature:as.factor(day)3 -3.169e-01 NA NA NA
## temperature:as.factor(day)4 4.107e+02 NA NA NA
## temperature:as.factor(day)5 4.162e-02 NA NA NA
## temperature:as.factor(day)6 1.288e-03 NA NA NA
## temperature:as.factor(day)7 -5.244e-01 NA NA NA
## temperature:as.factor(day)8 4.014e-01 NA NA NA
## temperature:as.factor(day)9 -1.255e-01 NA NA NA
## temperature:as.factor(day)10 4.917e+01 NA NA NA
## temperature:as.factor(day)11 4.128e+02 NA NA NA
## temperature:as.factor(day)12 -1.177e-01 NA NA NA
## temperature:as.factor(day)13 1.952e+02 NA NA NA
## temperature:as.factor(day)14 -1.234e-01 NA NA NA
## temperature:as.factor(day)15 -4.775e-02 NA NA NA
## temperature:as.factor(day)16 2.476e+02 NA NA NA
## temperature:as.factor(day)17 -4.601e-01 NA NA NA
## temperature:as.factor(day)18 -3.623e-01 NA NA NA
## temperature:as.factor(day)19 -2.151e+00 NA NA NA
## temperature:as.factor(day)20 -3.002e-01 NA NA NA
## temperature:as.factor(day)21 -2.596e-01 NA NA NA
## temperature:as.factor(day)22 -3.649e-01 NA NA NA
So, it is often more convenient and efficient to use random effects for categorical variables for which we do not care to much about the exact parameter estimates.
To fit a random effect in lme4 or glmmTMB you can write + (1 | group) for a random intercept of the variable group.
m_beetle_t_s_d <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature + (1|day) + (1|site), family = "binomial")
summary(m_beetle_t_s_d)## Family: binomial ( logit )
## Formula: presence ~ 1 + temperature + (1 | day) + (1 | site)
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## 433.4 450.5 -212.7 425.4 524
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## day (Intercept) 8.470e+00 2.910e+00
## site (Intercept) 4.127e-09 6.424e-05
## Number of obs: 528, groups: day, 22; site, 24
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.88933 0.79076 -3.654 0.000258 ***
## temperature 0.27667 0.03523 7.854 4.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That is much cleaner looking, but also more efficient computationally.
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = red]
presence; temp
node [shape = circle, color = black]
p; y; baseline; noise
node [shape = diamond, color=green]
intercept;
y -> p [label = ' plogis'];
p -> presence [label = ' Bernoulli', style=dashed];
intercept -> baseline ; noise -> baseline;
baseline -> y;
temp -> slope -> y ;
V_site -> noise [label = ' Gaussian', style=dashed];
V_day -> noise [label = ' Gaussian', style=dashed];
}")One small annoyance: The Anova() function tests only fixed effects.
Anova(m_beetle_t_s_d, type = 3)## Analysis of Deviance Table (Type III Wald chisquare tests)
##
## Response: presence
## Chisq Df Pr(>Chisq)
## (Intercept) 13.351 1 0.0002583 ***
## temperature 61.682 1 4.036e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To test random effects you need to compare nested models with/without a given random effect using anova() (lower case!).
m_beetle_t_s <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature+ (1|site), family = "binomial")
m_beetle_t_d <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature+ (1|day), family = "binomial")Test for the random intercept of site:
anova(m_beetle_t_d, m_beetle_t_s_d)## Data: beetles
## Models:
## m_beetle_t_d: presence ~ 1 + temperature + (1 | day), zi=~0, disp=~1
## m_beetle_t_s_d: presence ~ 1 + temperature + (1 | day) + (1 | site), zi=~0, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m_beetle_t_d 3 431.45 444.25 -212.72 425.45
## m_beetle_t_s_d 4 433.45 450.52 -212.72 425.45 0 1 1
No evidence for variance in the intercept due to site differences. But it does not really hurt to keep the effect of site in the model.
Test for the random intercept of day:
anova(m_beetle_t_s, m_beetle_t_s_d)## Data: beetles
## Models:
## m_beetle_t_s: presence ~ 1 + temperature + (1 | site), zi=~0, disp=~1
## m_beetle_t_s_d: presence ~ 1 + temperature + (1 | day) + (1 | site), zi=~0, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m_beetle_t_s 3 665.20 678.01 -329.60 659.20
## m_beetle_t_s_d 4 433.45 450.52 -212.72 425.45 233.75 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Strong evidence for variance in the intercept due to day differences.
Let's try to visualise this model:
emm_m_beetle_t_s_d <- as_tibble(emmeans(object = m_beetle_t_s_d, specs = ~ temperature,
at= list(temperature=seq(min(beetles$temperature), max(beetles$temperature), length.out = 100)),
type="response"))ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob), col="red") +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
fill="red", alpha=0.1) + facet_wrap(~day)But what is it we are seeing? This is a prediction for a typical day; that is, when the random effect of day has an effect of zero.
Emmeans does not give fine control over what we are doing. Predict may be better here.
Let's consider predictions for two different days; day 19 where we did not detect the beetle at all, and day 10 where we detected the beetle at all sites.
newdata_day19 <- tibble(temperature = seq(min(beetles$temperature), max(beetles$temperature), length.out = 100),
day=19, site="A")
newdata_day10 <- tibble(temperature = seq(min(beetles$temperature), max(beetles$temperature), length.out = 100),
day=10, site="A")We can set the random effects to zero, to obtain predictions for a median day:
predict(m_beetle_t_s_d, re.form = ~ 0, newdata = newdata_day19, type = "response")## [1] 0.02636552 0.02873309 0.03130644 0.03410216 0.03713797 0.04043273
## [7] 0.04400642 0.04788022 0.05207644 0.05661855 0.06153111 0.06683972
## [13] 0.07257090 0.07875204 0.08541116 0.09257678 0.10027765 0.10854248
## [19] 0.11739961 0.12687662 0.13699991 0.14779420 0.15928200 0.17148305
## [25] 0.18441370 0.19808627 0.21250840 0.22768244 0.24360480 0.26026541
## [31] 0.27764723 0.29572581 0.31446908 0.33383720 0.35378260 0.37425017
## [37] 0.39517775 0.41649666 0.43813254 0.46000630 0.48203522 0.50413420
## [43] 0.52621703 0.54819778 0.56999208 0.59151843 0.61269939 0.63346262
## [49] 0.65374180 0.67347733 0.69261690 0.71111580 0.72893711 0.74605163
## [55] 0.76243773 0.77808102 0.79297391 0.80711511 0.82050902 0.83316511
## [61] 0.84509728 0.85632321 0.86686376 0.87674233 0.88598432 0.89461664
## [67] 0.90266719 0.91016449 0.91713730 0.92361433 0.92962392 0.93519388
## [73] 0.94035129 0.94512235 0.94953228 0.95360522 0.95736421 0.96083116
## [79] 0.96402678 0.96697065 0.96968118 0.97217568 0.97447034 0.97658033
## [85] 0.97851976 0.98030183 0.98193879 0.98344200 0.98482204 0.98608869
## [91] 0.98725100 0.98831734 0.98929546 0.99019251 0.99101506 0.99176920
## [97] 0.99246052 0.99309418 0.99367493 0.99420712
predict(m_beetle_t_s_d, re.form = ~ 0, newdata = newdata_day10, type = "response")## [1] 0.02636552 0.02873309 0.03130644 0.03410216 0.03713797 0.04043273
## [7] 0.04400642 0.04788022 0.05207644 0.05661855 0.06153111 0.06683972
## [13] 0.07257090 0.07875204 0.08541116 0.09257678 0.10027765 0.10854248
## [19] 0.11739961 0.12687662 0.13699991 0.14779420 0.15928200 0.17148305
## [25] 0.18441370 0.19808627 0.21250840 0.22768244 0.24360480 0.26026541
## [31] 0.27764723 0.29572581 0.31446908 0.33383720 0.35378260 0.37425017
## [37] 0.39517775 0.41649666 0.43813254 0.46000630 0.48203522 0.50413420
## [43] 0.52621703 0.54819778 0.56999208 0.59151843 0.61269939 0.63346262
## [49] 0.65374180 0.67347733 0.69261690 0.71111580 0.72893711 0.74605163
## [55] 0.76243773 0.77808102 0.79297391 0.80711511 0.82050902 0.83316511
## [61] 0.84509728 0.85632321 0.86686376 0.87674233 0.88598432 0.89461664
## [67] 0.90266719 0.91016449 0.91713730 0.92361433 0.92962392 0.93519388
## [73] 0.94035129 0.94512235 0.94953228 0.95360522 0.95736421 0.96083116
## [79] 0.96402678 0.96697065 0.96968118 0.97217568 0.97447034 0.97658033
## [85] 0.97851976 0.98030183 0.98193879 0.98344200 0.98482204 0.98608869
## [91] 0.98725100 0.98831734 0.98929546 0.99019251 0.99101506 0.99176920
## [97] 0.99246052 0.99309418 0.99367493 0.99420712
Or we can look at how predictions differ on different days:
predict(m_beetle_t_s_d, re.form = NULL, newdata = newdata_day19, type = "response")## [1] 0.0001027289 0.0001122256 0.0001226001 0.0001339336 0.0001463146
## [6] 0.0001598399 0.0001746153 0.0001907562 0.0002083888 0.0002276510
## [11] 0.0002486932 0.0002716798 0.0002967904 0.0003242212 0.0003541864
## [16] 0.0003869199 0.0004226774 0.0004617378 0.0005044062 0.0005510152
## [21] 0.0006019285 0.0006575431 0.0007182924 0.0007846498 0.0008571322
## [26] 0.0009363040 0.0010227812 0.0011172365 0.0012204043 0.0013330860
## [31] 0.0014561567 0.0015905711 0.0017373715 0.0018976949 0.0020727821
## [36] 0.0022639868 0.0024727855 0.0027007888 0.0029497530 0.0032215930
## [41] 0.0035183966 0.0038424392 0.0041962002 0.0045823810 0.0050039239
## [46] 0.0054640326 0.0059661943 0.0065142041 0.0071121897 0.0077646397
## [51] 0.0084764324 0.0092528674 0.0100996989 0.0110231710 0.0120300550
## [56] 0.0131276896 0.0143240215 0.0156276492 0.0170478679 0.0185947160
## [61] 0.0202790226 0.0221124560 0.0241075719 0.0262778616 0.0286377987
## [66] 0.0312028832 0.0339896825 0.0370158672 0.0403002405 0.0438627586
## [71] 0.0477245404 0.0519078644 0.0564361484 0.0613339107 0.0666267076
## [76] 0.0723410452 0.0785042601 0.0851443658 0.0922898605 0.0999694928
## [81] 0.1082119808 0.1170456835 0.1264982208 0.1365960415 0.1473639414
## [86] 0.1588245328 0.1709976706 0.1838998435 0.1975435397 0.2119366015
## [91] 0.2270815856 0.2429751482 0.2596074777 0.2769617994 0.2950139763
## [96] 0.3137322328 0.3330770229 0.3530010646 0.3734495554 0.3943605786
predict(m_beetle_t_s_d, re.form = NULL, newdata = newdata_day10, type = "response")## [1] 0.6272756 0.6477068 0.6676114 0.6869349 0.7056305 0.7236587 0.7409880
## [8] 0.7575947 0.7734621 0.7885808 0.8029475 0.8165650 0.8294414 0.8415893
## [15] 0.8530252 0.8637692 0.8738440 0.8832744 0.8920869 0.9003091 0.9079696
## [22] 0.9150968 0.9217197 0.9278668 0.9335658 0.9388444 0.9437287 0.9482445
## [29] 0.9524162 0.9562671 0.9598195 0.9630945 0.9661120 0.9688907 0.9714483
## [36] 0.9738014 0.9759653 0.9779545 0.9797825 0.9814618 0.9830040 0.9844200
## [43] 0.9857197 0.9869124 0.9880067 0.9890106 0.9899313 0.9907755 0.9915496
## [50] 0.9922592 0.9929097 0.9935058 0.9940522 0.9945528 0.9950115 0.9954317
## [57] 0.9958167 0.9961694 0.9964925 0.9967883 0.9970594 0.9973075 0.9975349
## [64] 0.9977430 0.9979336 0.9981082 0.9982680 0.9984143 0.9985483 0.9986710
## [71] 0.9987834 0.9988862 0.9989804 0.9990666 0.9991455 0.9992178 0.9992839
## [78] 0.9993445 0.9993999 0.9994507 0.9994972 0.9995397 0.9995786 0.9996143
## [85] 0.9996469 0.9996768 0.9997041 0.9997292 0.9997521 0.9997731 0.9997923
## [92] 0.9998098 0.9998259 0.9998407 0.9998541 0.9998665 0.9998778 0.9998881
## [99] 0.9998976 0.9999063
Let's generalize that and make predictions for each day at the same time:
newdata_days1_10_19 <- expand_grid(temperature = seq(min(beetles$temperature), max(beetles$temperature), length.out = 100),
day=1:22, site="A")
newdata_days1_10_19$prob <- predict(m_beetle_t_s_d, re.form = NULL,
newdata = newdata_days1_10_19, type = "response")ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob), col="red") +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
fill="red", alpha=0.1) +
geom_line(newdata_days1_10_19, mapping = aes(temperature, y=prob, col=as.factor(day)))+facet_wrap(~day)If we want confidence intervals on those lines we need to calculate confidence intervals on the link scale and then apply the back-transformation:
newdata_days1_10_19 <- expand_grid(temperature = seq(min(beetles$temperature), max(beetles$temperature), length.out = 100),
day=1:22, site="A")
newdata_days1_10_19 <- bind_cols(newdata_days1_10_19, as_tibble(predict(m_beetle_t_s_d, re.form = NULL,
newdata = newdata_days1_10_19, type = "link", se.fit = TRUE) ) )
newdata_days1_10_19 <- newdata_days1_10_19 %>% mutate(prob = plogis(fit), lowCI = plogis(fit - 1.96*se.fit ), upCI = plogis(fit+1.96*se.fit))ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob)) +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
alpha=0.1) +
geom_line(newdata_days1_10_19, mapping = aes(x=temperature, y=prob, col=as.factor(day))) +
geom_ribbon(newdata_days1_10_19, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI, fill=as.factor(day)), inherit.aes = FALSE, alpha=0.1) + facet_wrap(~day)In black we see the median day prediction (no random effect), in color, predictions for 3 different days.
We can see quite contrasted predictions for different days. On day 10 and 19 the model makes quite uncertain predictions; that is expected, because we observed only 1s, or 0s, respectively. However, the random effect extracted enough information from other days to predict that the probability of presence should not be exactly 1 at low temperature in day 10, and not exactly 0 at high temperatures in day 19.
One last thing we may want to see, is the average response, across all days.
newdata_alldays <- expand_grid(temperature = seq(min(beetles$temperature), max(beetles$temperature), length.out = 100),
day= unique(beetles$day), site="A")
newdata_alldays <- bind_cols(newdata_alldays, as_tibble(predict(m_beetle_t_s_d, re.form = NULL,
newdata = newdata_alldays, type = "link", se.fit = TRUE)))meanpred <- newdata_alldays %>% group_by(temperature) %>%
summarise(mean_prob = mean(plogis(fit)),
lowCI = mean(plogis(fit-1.96*se.fit)), upCI = mean(plogis(fit + 1.96*se.fit)))## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob)) +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
alpha=0.1) +
geom_line(newdata_days1_10_19, mapping = aes(x=temperature, y=prob, col=as.factor(day))) +
geom_ribbon(newdata_days1_10_19, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI, fill=as.factor(day)), inherit.aes = FALSE, alpha=0.1) +
geom_line(meanpred, mapping = aes(x=temperature, y=mean_prob), col = "goldenrod") +
geom_ribbon(meanpred, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI), inherit.aes = FALSE, fill="goldenrod", alpha=0.3) + facet_wrap(~day)In addition to the previous lines, we can see in golden the prediction for the average presence across all days.
Finally, we could add a prediction line from a naive model that did not include random effects:
ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob)) +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
alpha=0.1) +
geom_line(meanpred, mapping = aes(x=temperature, y=mean_prob), col = "goldenrod", size=3) +
geom_ribbon(meanpred, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI), inherit.aes = FALSE, fill="goldenrod", alpha=0.3) +
geom_smooth(method="glm", method.args=list(family="binomial"))## `geom_smooth()` using formula 'y ~ x'
Here, the naive prediction is quite similar to the mean prediction from the mixed effect model, because the dataset is quite well balanced, but the confidence interval is very different. The difference could be more dramatic in a non-balanced dataset.
If we facet wrap by day, we get a model for each day. You will see that the model fitted on each days are much worse than the global model using all the information at once. In particular, on some days there is not enough information to fit a glm() when taken alone, but prediction of the model using all data appear reasonable.
ggplot(beetles, aes(x=temperature, y=presence)) + geom_beeswarm(groupOnX = FALSE) +
geom_line(data = emm_m_beetle_t_s_d, mapping = aes(y=prob)) +
geom_ribbon(data = emm_m_beetle_t_s_d, mapping = aes(x=temperature, ymin = lower.CL, ymax= upper.CL),
inherit.aes = FALSE,
alpha=0.1) +
geom_line(newdata_days1_10_19, mapping = aes(x=temperature, y=prob, col=as.factor(day))) +
geom_ribbon(newdata_days1_10_19, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI, fill=as.factor(day)), inherit.aes = FALSE, alpha=0.1) +
geom_line(meanpred, mapping = aes(x=temperature, y=mean_prob), col = "goldenrod") +
geom_ribbon(meanpred, mapping = aes(x=temperature, ymin=lowCI, ymax=upCI), inherit.aes = FALSE, fill="goldenrod", alpha=0.3) +
geom_smooth(method="glm", method.args=list(family="binomial")) + facet_wrap(~day)## `geom_smooth()` using formula 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
OK, now you should start to see why GLMMs can be confusing and complicated. The difference between these lines can seem quite subtle, but they have different meanings, and in some cases will tell very different stories.
In a linear mixed model there is no difference between median prediction and mean prediction because linear models assume effects are additive.
In GLMs effects are additive only on the scale of the "link scale", but not on the scale of the data.
The distortion due to the change in scale can be simulated. First, let's draw random numbers following a normal distribution of mean 1.
x1 <- rnorm(n = 10000, mean = 2, sd = 2)What is their mean?
mean(x1)## [1] 1.984147
(basically 2)
Now, let's apply a plogis() transform to these numbers; as if we were back-transforming predictions from a binary GLM:
px1 <- plogis(x1)You could expect the mean of px1, to be plogis(2) or plogis(mean(x1)), but...
plogis(mean(x1))## [1] 0.8791225
mean(px1)## [1] 0.7736642
Even more dramatic, we can apply an exp() transform, as if we were back-transforming predictions from a Poisson GLM:
lx1 <- exp(x1)
mean(lx1)## [1] 56.36405
exp(mean(x1))## [1] 7.272838
The difference between the mean of the transformed values and the transformed value of the mean is called "Jensen's inequality"; and is the basic reason why you need to be careful when making predictions with a GLMM. Imagine you have a GLMM with a random effect of year; What are you trying to do?
We have answered the first part of our question: temperature increases the probability of beetle detection. But is temperature a reliable indicator? Do different days, or different sites have different responses to temperatures?
We already saw that fitting a fixed effect interaction between temperature and day did not work here; the model did not converge. What we need to do, is to fit random effects for the slope of temperature.
You know the 1 + we write at the beginnin of formula? It stands of "intercept", and that is why we write random intercept models as (1 | group). So, if we want a varying effect of temperature in addition to the random effect on the intercept, we can write (1 + temperature | group):
Instead, we can use a random slope model:
m_beetle_tXd_s <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature + (1 + temperature | day) +
(1|site), family = "binomial")
summary(m_beetle_tXd_s)## Family: binomial ( logit )
## Formula:
## presence ~ 1 + temperature + (1 + temperature | day) + (1 | site)
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## 436.3 461.9 -212.1 424.3 522
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev. Corr
## day (Intercept) 1.355e+01 3.682e+00
## temperature 5.261e-03 7.253e-02 -0.83
## site (Intercept) 5.797e-09 7.614e-05
## Number of obs: 528, groups: day, 22; site, 24
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.88018 0.95394 -3.019 0.00253 **
## temperature 0.26860 0.04191 6.408 1.47e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m_beetle_tXd_s)## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: presence
## Chisq Df Pr(>Chisq)
## temperature 41.068 1 1.47e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = red]
presence; temp
node [shape = circle, color = black]
p; y; noise1; noise2; slopes; baselines;
node [shape = diamond, color=green]
intercept;
y -> p [label = ' plogis'];
p -> presence [label = ' Bernoulli', style=dashed];
intercept -> baselines ;
noise1 -> baselines;
baselines -> y;
slope -> slopes;
noise2 -> slopes;
temp -> slopes -> y ;
V_site -> noise1 [label = ' Gaussian', style=dashed];
V_day -> noise2 [label = ' Gaussian', style=dashed];
V_day -> noise1 [label = ' Gaussian', style=dashed];
}")The Anova() returns only a test for the fixed effect of temperature. How do we test whether the effect of temperature varied among days? We can compare a model with the varying slope to a model without it, using the function anova() (lower case!).
m_beetle_t_d_s <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature + (1 | day) + (1|site),
family = "binomial")
anova(m_beetle_t_d_s, m_beetle_tXd_s)## Data: beetles
## Models:
## m_beetle_t_d_s: presence ~ 1 + temperature + (1 | day) + (1 | site), zi=~0, disp=~1
## m_beetle_tXd_s: presence ~ 1 + temperature + (1 + temperature | day) + (1 | site), zi=~0, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m_beetle_t_d_s 4 433.45 450.52 -212.72 425.45
## m_beetle_tXd_s 6 436.30 461.91 -212.15 424.30 1.1483 2 0.5632
So, there is not strong evidence for a varying effect of temperature among days.
What about variation in the effect of temperature among sites?
m_beetle_tXs_d <- glmmTMB(data = beetles,
formula = presence ~ 1 + temperature + (1 | day) +
(1+ temperature |site), family = "binomial")## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
summary(m_beetle_tXs_d)## Family: binomial ( logit )
## Formula:
## presence ~ 1 + temperature + (1 | day) + (1 + temperature | site)
## Data: beetles
##
## AIC BIC logLik deviance df.resid
## NA NA NA NA 522
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev. Corr
## day (Intercept) 8.470e+00 2.910e+00
## site (Intercept) 3.832e-10 1.957e-05
## temperature 1.251e-11 3.538e-06 0.48
## Number of obs: 528, groups: day, 22; site, 24
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.88933 0.79076 -3.654 0.000258 ***
## temperature 0.27667 0.03523 7.854 4.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model has difficulties converging so we cannot compare it safely to another model. However, the reason why the model is not converging is our anwser: the variance parameters are effectively 0, so there is no evidence for a varying effect of temperature among sites.
Oh, and by the way, we should check the fit of our models. I have been ignoring that aspect a little bit to focus on other content; but in real work we should do it:
testResiduals(m_beetle_tXd_s)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.039023, p-value = 0.3973
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.99341, p-value = 0.72
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 0, simulations = 528, p-value = 0.02557
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.000000000 0.006962165
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.039023, p-value = 0.3973
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.99341, p-value = 0.72
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 0, simulations = 528, p-value = 0.02557
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.000000000 0.006962165
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0
Again, some evidence of outliers, but nothing looks too scary.
We study sexual selection in mosquito fish. We want to test whether the size of male gonopodium (it's a ventral fin they use in display and mating) predicts reproductive success. We have studied fish in 14 different tanks and we want to account for possible variation among tanks, so we will need a random effect for tank. The response variable, reproductive success is a count variable, so we will probably use a count data GLM. GLM + random effect = GLMM.
fish <- read_csv("fishrepro.csv")## Parsed with column specification:
## cols(
## id = col_double(),
## reproductive_success = col_double(),
## gonopodium_size = col_double(),
## tank_id = col_double()
## )
summary(fish)## id reproductive_success gonopodium_size tank_id
## Min. : 1.0 Min. : 0.0000 Min. : 7.190 Min. : 1.000
## 1st Qu.: 355.8 1st Qu.: 0.0000 1st Qu.: 9.349 1st Qu.: 4.000
## Median : 710.5 Median : 0.0000 Median :10.015 Median : 7.000
## Mean : 710.5 Mean : 0.7753 Mean :10.016 Mean : 7.387
## 3rd Qu.:1065.2 3rd Qu.: 1.0000 3rd Qu.:10.652 3rd Qu.:11.000
## Max. :1420.0 Max. :45.0000 Max. :13.390 Max. :14.000
ggplot(fish, aes(x=reproductive_success))+geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(fish, aes(x=gonopodium_size, y=reproductive_success)) + geom_point()tibble(tabyl(dat = fish, var1 = id))## # A tibble: 1,420 x 3
## id n percent
## <dbl> <dbl> <dbl>
## 1 1 1 0.000704
## 2 2 1 0.000704
## 3 3 1 0.000704
## 4 4 1 0.000704
## 5 5 1 0.000704
## 6 6 1 0.000704
## 7 7 1 0.000704
## 8 8 1 0.000704
## 9 9 1 0.000704
## 10 10 1 0.000704
## # … with 1,410 more rows
The variable id is just row number. It does not seem particularly useful for now... but wait for the twist.
model_poisson <- glmmTMB(reproductive_success ~ 1 + gonopodium_size + (1|tank_id), data = fish, family = poisson)
summary(model_poisson)## Family: poisson ( log )
## Formula: reproductive_success ~ 1 + gonopodium_size + (1 | tank_id)
## Data: fish
##
## AIC BIC logLik deviance df.resid
## 3655.5 3671.3 -1824.8 3649.5 1417
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## tank_id (Intercept) 0.01442 0.1201
## Number of obs: 1420, groups: tank_id, 14
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4969 0.3234 -23.18 <2e-16 ***
## gonopodium_size 0.6983 0.0299 23.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(model_poisson)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.086981, p-value = 9.324e-10
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.9641, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 23, simulations = 1420, p-value = 0.002321
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.01029461 0.02420511
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.01619718
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.086981, p-value = 9.324e-10
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.9641, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 23, simulations = 1420, p-value = 0.002321
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.01029461 0.02420511
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.01619718
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = red]
repro; size
node [shape = circle, color = black]
lambda; y; noise; baselines;
node [shape = diamond, color=green]
intercept;
y -> lambda [label = ' exp'];
lambda -> repro [label = ' Poisson', style=dashed];
intercept -> baselines;
baselines ->y; size -> slope -> y ; noise -> baselines;
V_tank -> noise [label = ' Gaussian', style=dashed];
}")model_nbinom1 <- glmmTMB(reproductive_success ~ 1 + gonopodium_size+ (1|tank_id), data = fish, family = nbinom1())
summary(model_nbinom1)## Family: nbinom1 ( log )
## Formula: reproductive_success ~ 1 + gonopodium_size + (1 | tank_id)
## Data: fish
##
## AIC BIC logLik deviance df.resid
## 3212 3233 -1602 3204 1416
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## tank_id (Intercept) 4.613e-09 6.792e-05
## Number of obs: 1420, groups: tank_id, 14
##
## Overdispersion parameter for nbinom1 family (): 1.25
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.15651 0.43928 -14.02 <2e-16 ***
## gonopodium_size 0.57295 0.04126 13.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(model_nbinom1)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.021229, p-value = 0.5442
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.3377, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 5, simulations = 1420, p-value = 0.07025
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001144258 0.008197856
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.003521127
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.021229, p-value = 0.5442
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.3377, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 5, simulations = 1420, p-value = 0.07025
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001144258 0.008197856
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.003521127
model_nbinom2 <- glmmTMB(reproductive_success ~ 1 + gonopodium_size+ (1|tank_id), data = fish, family = nbinom2())
summary(model_nbinom2)## Family: nbinom2 ( log )
## Formula: reproductive_success ~ 1 + gonopodium_size + (1 | tank_id)
## Data: fish
##
## AIC BIC logLik deviance df.resid
## 3174.8 3195.8 -1583.4 3166.8 1416
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## tank_id (Intercept) 2.021e-09 4.496e-05
## Number of obs: 1420, groups: tank_id, 14
##
## Overdispersion parameter for nbinom2 family (): 0.762
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.33988 0.49339 -14.88 <2e-16 ***
## gonopodium_size 0.68396 0.04749 14.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(model_nbinom2)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.01608, p-value = 0.8563
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.2224, p-value = 0.016
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 6, simulations = 1420, p-value = 0.1333
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001552160 0.009173968
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.004225352
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.01608, p-value = 0.8563
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.2224, p-value = 0.016
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 6, simulations = 1420, p-value = 0.1333
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001552160 0.009173968
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.004225352
Instead of using a different family, we can try to model over-dispersion with an observation-level random effect:
model_poisson_re <- glmmTMB(reproductive_success ~ 1 + gonopodium_size + (1|id)+ (1|tank_id), data = fish, family = poisson)
summary(model_poisson_re)## Family: poisson ( log )
## Formula: reproductive_success ~ 1 + gonopodium_size + (1 | id) + (1 |
## tank_id)
## Data: fish
##
## AIC BIC logLik deviance df.resid
## 3133.5 3154.6 -1562.8 3125.5 1416
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## id (Intercept) 1.134e+00 1.065e+00
## tank_id (Intercept) 2.499e-10 1.581e-05
## Number of obs: 1420, groups: id, 1420; tank_id, 14
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.81435 0.52769 -14.81 <2e-16 ***
## gonopodium_size 0.67597 0.05006 13.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
testResiduals(model_poisson_re)## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012294, p-value = 0.9827
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.0481, p-value = 0.52
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 5, simulations = 1420, p-value = 0.07025
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001144258 0.008197856
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.003521127
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.012294, p-value = 0.9827
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.0481, p-value = 0.52
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa outlier test based on exact binomial test
##
## data: simulationOutput
## outliers at both margin(s) = 5, simulations = 1420, p-value = 0.07025
## alternative hypothesis: true probability of success is not equal to 0.007968127
## 95 percent confidence interval:
## 0.001144258 0.008197856
## sample estimates:
## frequency of outliers (expected: 0.00796812749003984 )
## 0.003521127
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
node [shape = square, color = red]
repro; size
node [shape = circle, color = black]
lambda; y; noise; baselines;
node [shape = diamond, color=green]
intercept;
y -> lambda [label = ' exp'];
lambda -> repro [label = ' Poisson', style=dashed];
intercept -> baselines;
baselines ->y; size -> slope -> y ; noise -> baselines;
V_tank -> noise [label = ' Gaussian', style=dashed];
V_id -> noise [label = ' Gaussian', style=dashed];
}")This model says something like "each observation originated from the effect of the predictor, plus some unexplained noise, and that noise is additive on the log-scale, which means, unknown effects are multiplicative on the data scale". It does not always work, but it is often a reasonable model for biological data.